Alignstein: Optimal transport for improved LC-MS retention time alignment

Abstract Background Reproducibility of liquid chromatography separation is limited by retention time drift. As a result, measured signals lack correspondence over replicates of the liquid chromatography–mass spectrometry (LC-MS) experiments. Correction of these errors is named retention time alignment and needs to be performed before further quantitative analysis. Despite the availability of numerous alignment algorithms, their accuracy is limited (e.g., for retention time drift that swaps analytes’ elution order). Results We present the Alignstein, an algorithm for LC-MS retention time alignment. It correctly finds correspondence even for swapped signals. To achieve this, we implemented the generalization of the Wasserstein distance to compare multidimensional features without any reduction of the information or dimension of the analyzed data. Moreover, Alignstein by design requires neither a reference sample nor prior signal identification. We validate the algorithm on publicly available benchmark datasets obtaining competitive results. Finally, we show that it can detect the information contained in the tandem mass spectrum by the spatial properties of chromatograms. Conclusions We show that the use of optimal transport effectively overcomes the limitations of existing algorithms for statistical analysis of mass spectrometry datasets. The algorithm’s source code is available at https://github.com/grzsko/Alignstein.

Narodowe Centrum Nauki (2018/29/B/ST6/00681) prof. Anna Gambin prof. Błażej Miasojedow Abstract: Background Reproducibility of liquid chromatography separation is limited by retention time drift. As a result, measured signals lack correspondence over replicates of the liquid chromatography mass spectrometry (LC-MS) experiments. Correction of these errors is named retention time alignment and needs to be performed before further quantitative analysis. Despite the availability of numerous alignment algorithms, their accuracy is limited, e.g. for retention time drift that swaps analytes' elution order.

Results
We present the Alignstein, an algorithm for liquid chromatography-mass spectrometry retention time alignment. It correctly finds correspondence even for swapped signals.
To achieve this, we implemented the generalization of the Wasserstein distance to compare multidimensional features without any reduction of the information or dimension of the analyzed data. Moreover, Alignstein by design requires neither a reference sample nor prior signal identification. We validate the algorithm on publicly available benchmark datasets obtaining competitive results. Finally, we show that it can detect the information contained in the tandem mass spectrum by the spatial properties of chromatograms.

Conclusions
We show that the use of optimal transport effectively overcome the limitations of existing algorithms for statistical analysis of mass spectrometry datasets. The algorithm's source code is available at https://github.com/grzsko/Alignstein.
When we were designing Alingstein, we aimed to take advantage of incorporating all features' signal information (span over RT and isotopic envelope) by its shape and the distribution of all signals.
We mentioned Quandenser in a paragraph listing other feature-matching algorithms. We address the same issues to Quandenser as raised in the original manuscript, i.e.
"To the best of the authors' knowledge, all matching algorithms reduce multidimensional features to one-dimensional extracted ion chromatograms or a single points with monoisotopic peak M/Z and average RT value, ignoring the information of isotopic envelope or feature span over the RT dimension. Without feature spatial characteristics and information of coeluting ions, elution order swaps are practically undetectable." Quantification is an important task. However, Alignstein is not designed as a quantification tool. The goal of Aligstein is to make the best possible RT alignment before quantification. Extending it to quantification benchmarking is out of the scope of the current paper.
Moreover, the proposed work benchmarks quantification techniques but it does not document RT alignment. Thus, we cannot derive any ground truth from this work and it does not allow for accurate RT benchmarking.
We described alignment as a preprocessing step of quantification as detailed in answer to Major 1.
> 5.I am very skeptical of evaluating performance on swapped signal with > simulated data. I would urge the authors to seek further validation that does > not involve simulated data. Or at least explain in detail why this is not > possible to perform on experimentally acquired data.
The reason for using simulated data are limits of the other benchmarking techniques: limited ground-truth coverage of input datasets. Usually, we cannot correctly identify even half of our input data. Thus, we cannot check the accuracy of the alignment of features not covered by the ground truth. We checked the general accuracy of alignment on several real datasets: P1, P2, Mussels, and additionally, after revision, M1. Simulated data gives us the full information about the true correspondence. We used it only for benchmarking swaps detection, which affects only a fraction of features and thus is prone to ground-truth incompleteness. We modified this statement as follows (+ a preceding sentence for readability): '(...) designing algorithms for efficient and precise analysis of LC-MS datasets remains challenging.
One of these challenges is the correction of errors caused by retention time (RT) drift.' > 2.The following sentence is grammatically incorrect and I think it does not > add anything to the better comprehension of the subject or problem at hand: > "Also, inaccuracies of the detector in mass spectrometer may result in variable > mass-tocharge ratio (M/Z) measurements, but in comparison to RT, the required > correction is of essentially lower order of magnitude.". I suggest to remove > this sentence.
As suggested, we removed this sentence.  Although OT theory has over 50 years of history, its modern approach dates back to 2013 when Marco Cuturi proposed OT computation with the Sinkhorn-Knopp algorithm [1]. The application of optimal transport in life sciences is a case of the last 3-4 years since Chizat et al. work in 2018 (reference 40), where they proposed a technique of approximating the transport plan (cf. [2] or [3]).
Accordingly, interest in applying OT in MS is indeed very recent. Papers, which we cited in this work (references 27,32,34,35), are to the best of our knowledge the only works of applying OT in MS, except one older, preliminary work written by our group [4] and two another from 2021: one more theoretical [5], one for elution profiles [6]. We added [6] to references (33) to show a more broad interest in OT in MS.

_Reviewer 2_
> Comments: > Skoraczyński et al describe an algorithm for retention time alignment which > makes use of the generalized Wasserstein distance. This is beneficial because > it allows to also consider peak shape and isotopic distribution to find > corresponding features in MS runs. Based on several benchmark data sets the > authors demonstrate comparable performance as achieved by the best algorithms > currently available. > The article is well written, and the authors managed to find an excellent > balance for mathematical detail presented which makes the article good to read > despite the complicated subject. > > Major Comments: > -The biggest data set the authors use in their benchmarks is the CAAP P2 > data set which comprises 5 fractions run in triplicates. Precision achieved > here is much lower than in the smaller P1 data set. In the discussion the > authors point out that their algorithm has difficulties with distant features. > Obviously, retention time deviations tend to increase in larger studies. To > evaluate if this might be an issue constraining the performance of the > presented algorithm, the authors should also benchmark on a larger data set. > This will also be of high interest in times where the size of proteomics > studies continuously grows.
The precision drop from duplicate P1 set to triplicate P2 set is observable for all algorithms. It is because finding the feature correspondence for two sets is relatively easy (e.g. MassUntangler is implemented only for pairwise alignment, Alignstein has a special, simplified implementation for 2 chromatograms). However, finding feature matching for more than two chromatograms is significantly more difficult, both for conceptual (So what formal problem we are solving?) and implementation sides (and how to do it?).
To benchmark on a larger dataset, we added the description of benchmarking on the M1 dataset from the CAAP study. It consists of 44 metabolomic chromatograms. Initially, we focused on proteomics and omitted this set due to its data dump incompleteness. Based on data provided by the authors, the evaluation script would not give a clear answer if Alignstein performs better or worse than the other tools. Now, we resolved this obstacle by detecting features de novo. We used the tool and parameters from the original study (XCMS, method = "centWave", peakwidth = c(20, 50), snthresh = 5, ppm = 12). We matched newly detected features with the original ones and ran the evaluation script for such matched features. Results are competitive (precision 0.91, recall 0.88, F-score 0.89), which shows not only good scalability but also that Alignstein is not limited to proteomics. We reported the results in Table 1. Feature de novo detection is described in Suppl. Mat. section 4: 'M1 dataset lacks the spatial information of analyzed features. For this reason, we reproduced feature detection using XCMS3 with parameters detailed in the CAAP study. We matched features from the study with newly generated features by checking if the previous feature representation falls within the bounding box of new features. Such matched features were further input of the evaluation script.' > Minor Comments: > -The authors should mention the run time of the algorithm on the individual > benchmark data sets and the computer on which the benchmarks were performed.

Powered by Editorial Manager® and ProduXion Manager® from Aries Systems Corporation
Good remark, we measured runtimes and presented results in Table 2. We added the following details of the computer in the Methods sections: 'Computation was done on a computer with Linux operating system and 24 Intel Xeon E5-2620 2.10 GHz processors + 62 GB RAM. We measured wall time using Linux built-in time command.' > - Figure 1 in Box Clustering: hierarchical *clustering* Corrected.
> -Page 7 after formula (1): where d(x, y) is *a distant* between peaks x and > y; should be *the distance* Corrected.
> -Page 8: We used input chromatograms as mzML and mzXML files and features as > featureXML files provided by authors, as well as *their?* alignment precision > and alignment recall evaluation script We rewrote this sentence as follows: 'We used input chromatograms as mzML and mzXML files and features as featureXML files provided by authors of the CAAP study. We measured alignment precision and recall using an evaluation script written in R programming language by the authors of this study.' > -Methods: as the paper deals with retention time alignment, the authors > should also provide LC parameters used in the sample data sets (precolumn?, > solvents, gradient, column, flow rate) We added the following information to 'Mussels toxicological response experiment summary' Section in Methods: 'Acclaim PepMap C18 nano column 75 μm × 25 cm, 3 μm, 100 Å plus bypass, a linear gradient of 96% buffer A (0.5% Acetic Acid) and 4% buffer B (80% acetonitrile in 0.5% acetic acid) to 60% buffer A and 40% buffer B, a flow rate of 300 ml/min for 120 minutes).'

Introduction
Advances in liquid chromatography mass spectrometry (LC-MS) have provided a remarkable insight into the functioning of the organisms, ranging from protein level [1], through tissue [2] to environmental networks [3]. All of these research studies benefit from the possibility to separate complex mixtures in the liquid chromatographic column and then measure the analytes with high throughput mass spectrometer. Although LC-MS systems provide precise answers to both quantitative and qualitative biological and medical questions, designing algorithms for efficient and precise analysis of LC-MS datasets remains challenging.
One of these challenges is the correction of errors caused by retention time (RT) drift. It limits the reproducibility of LC separation, which is important for experiments usually acquired in many (even hundreds) replicates. RT drift became a significant obstacle with the emergence of high-performance chromatography (HPLC) and ultra-performance chromatography (UPLC) technologies. For example, nanoflow UPLC column separation takes a relatively long time, usually up to several hours. For these experiments, the elution time of peptides may vary up to 5 minutes [4] or even 10 minutes [1].
RT drift can be corrected by the experimental protocol only to a limited extent [5]. It may change the whole gradient or affect only single peaks. These changes may be caused by various reasons such as the unstable mobile phase, the column  change or degradation, sample chemical instability, or imprecise experiment setup [6,7,8].
RT drift requires a correction, usually named the RT alignment. It results in the correspondence of signals across runs [9]. For example, in proteomics, the signal correspondence of the same peptides is needed for further applying labelfree quantification (LFQ) for which samples must be measured separately [10,11]. Moreover, for LFQ techniques, we cannot obtain the correspondence any other way because analytes do not have any additional information, such as metabolic labels, or chemical tags [12,13].
Here, we present a novel alignment algorithm named Alignstein (cf. Fig. 1). It finds the correspondence of initially detected features, i.e. convex sets of peaks representing the signal of a single analyte. It overcomes the limitations of currently existing algorithms and properly resolves the correspondence of analytes of swapped elution order. To achieve this, we take advantage of the generalization of the Wasserstein distance (GWD) [14] to compare multidimensional features. To obtain the most feasible alignment results, Alignstein has formulated a complex optimization signal-matching problem, for which we use clustering and network flow algorithms to achieve a computationally tractable outcome.
This article is organized as follows. First, we characterize Alignstein and analyze how it deals with the swapped signals. Then, we validate the algorithm on publicly available benchmark datasets. Finally, we show the applicability of our approach to detecting corresponding biomarkers in differing samples.

Findings
The problem: resolving swaps RT drift may swap the order of eluting analytes. In the proteomic experiment (cf. Methods), we analyzed that about 3 % of all feature pairs are swapped between two chromatograms. Although many of the available algorithms properly align most signals, still they fail to resolve swaps.
The vast majority of approaches to RT alignment are socalled warping algorithms, e.g. OpenMS [15], MetAlign [16], MZMine 2 [17], SIMA [18], the solution proposed by Zhang [19], DIAlignR [20], the solution proposed by Chiung-Ting Wu, et al. [21]. These algorithms consist of applying a warping function that transforms the chromatograms by shifting, stretching, and squeezing. These transformations result in a close distance between corresponding signals. After alignment, however, further feature detection and matching are still required to obtain the signal correspondence. These algorithms' applicability is limited because the warping function is applied under the assumption that ions elute monotonically with RT. Thus, they are not able to deal with elution order swaps.
Alternatively, a rarer implemented approach is feature matching, e.g. OpenMS [15] (both warping and matching algorithm), MassUntangler [22], LWBMatch [23], the solution proposed by Wandy, et al. [24], MS-Dial [25], Quandenser [26]. Algorithms by feature matching find the correspondence between initially detected features of two or more chromatograms. Corresponding features represent the same analyte and further will be referred to as consensus features. To the best of the authors' knowledge, all matching algorithms reduce multidimensional features to one-dimensional extracted ion chromatograms or a single points with monoisotopic peak M/Z and average RT value, ignoring the information of isotopic envelope or feature span over the RT dimension. Without feature spatial characteristics and information of coeluting ions, elution order swaps are practically undetectable [8]. The main reason for this simplification lies in the difficulty to find multidimensional feature dissimilarity measures. Typically, Euclidean distance between points or one-dimensional cosine-like spectra similarity scores is applied [27,28]. Although the limitations of these scores are known, still there is a shortage of their effective improvements [28,29].

The solution: the Alignstein algorithm
Alignstein is the RT alignment algorithm by feature matching that properly deals with features of swapped order. It is possible because the algorithm represents features by all signals contained within their boundaries. To cope with this representation, we use the generalization of the Wasserstein distance as a feature dissimilarity measure. It originates from the optimal transport theory and has been recently attracting growing attention to various problems of mass spectrometry [27,32,33,34,35]. Its design significantly differs from currently existing similarity scores and thus it overcomes the majority of their limitations. The Wasserstein distance describes the cost of the optimal way how to transform one feature into the other one. The transformations include not only shifting the signal from one feature to another but also splitting or combining the signal between peaks (cf. Fig. 2). The key strength of Wasserstein distance is the ability to compute features' similarity by their spatial shape (cf. Fig. 3). Moreover, it easily scales with dimension. Generalizing the Wasserstein distance allows comparing noisy features by introducing an appropriate penalty. This provides a highly flexible measure for effective computing feature distance and similarity.
Alignstein aligns chromatograms by finding consensus features. It is done in two phases (cf. Fig. 1): at first, feature centroids are clustered to find candidates for consensus features, which are then verified by the feature matching phase. During the latter phase, the algorithm computes the optimal feature matching, which represents the most similar feature pairs throughout all chromatograms (cf. Methods). We solve this problem by reducing it to finding the maximum flow of minimum cost in an appropriate flow network (cf. Fig. 4). Consensus features are then created from optimal feature matching with regard to initial centroid clustering. The such formulation allows for aligning chromatograms without a requirement for a reference sample or a prior feature identification. It also easily scales with a number of input chromatograms. Finally, this algorithm is not limited to correcting RT perturbations in repeated experimental runs, it also accurately aligns the majority of detected corresponding biomarkers from samples of different experimental treatments.

Dealing with swapped signal
We assessed that Alignstein properly matches swapped features. For this purpose, we collected over 580 identified features from the chromatograms obtained from Barranger et al.'s work [3] (see Methods). We simulated RT drift by randomly moving features within range (-150 s, 150 s) in the RT dimension and within range (-0.3 Da, 0.3 Da) in the M/Z dimension. These two sets of features: one with original features and the second with drifted features represented chromatograms to be aligned. For such a formulation, about 2 % (ca. 3400) of feature pairs were swapped. We aligned these two sets and measured a number of properly matched features and a fraction of properly resolved swapped feature pairs. Our tool matched practically all drifted features (96 %) and the vast majority of swapped feature pairs (91 %). We compared our results with two opensource feature matching algorithms: OpenMS, and LWBMatch. OpenMS had high feature matching precision, it matched the majority of drifted features (80 %). However, its accuracy drastically decreased when analyzing only swapped feature pairs (61 %). LWBMatch had a significantly lower matching precision, it matched 24 % of drifted features and only 3 % of swapped feature pairs. to GWD between them. Additional node Tr ('trash') gives the possibility to not match the feature with cost c. Every edge has capacity equal to 1, except edge between S (source) and Tr and edge between Tr and T (sink) with capacities equal to max{0, s -n} and max{0, n -s} respectively (at most one of them has nonzero capacity). Edges between R 1 , . . . , Rm and C 1 , . . . , Cs give the restriction that aby feature can be matched with at most one cluster. As a result, we take all matchings (L i , C k ). We recognize the consensus feature by its cluster.

Algorithm validation on benchmark datasets
We evaluated the accuracy of our method by comparing alignment quality on public benchmark datasets. We reproduced the evaluation protocol from Lange et al. study [36] (further referred to as the CAAP study). We analyzed two proteomic datasets from CAAP evaluation: P1 and P2, and one metabolomic: M1. P1 set contained the analysis of E. coli protein extracts and consisted of 6 fractions at different salt bumps, every fraction in 2 different runs. Analogously, P2 contained the analysis of protein extract from M. smegmatis in 5 fractions every in 3 replicated runs. M1 contained the analysis of leaf tissue extract from A. thaliana in 44 repeated runs. To assess the correctness of alignment algorithms, the authors of the CAAP study proposed alignment precision and alignment recall measures (cf. Methods). Moreover, as proposed by the authors of SIMA algorithm [18], we computed the F-score, which is a harmonic mean of alignment precision and recall.
We analyzed sets P1, P2, and M1 and compared Alignstein with the results of the OpenMS alignment algorithm [15] from the CAAP study. We chose OpenMS because it achieved significantly better results than the other tools and represented a state-of-the-art solution at the time of the original study. Moreover, we included in comparison the available results of algorithms published more recently: MZMine 2 [17], SIMA [18], MassUntagler [22] (only P1 set), and Wandy et al. [24]. We measured the time of alignment computation, results are presented in Table 2.
Alignstein obtained highly competitive results in CAAP evaluation. For the P1 dataset, it matched perfectly almost all features, its precision and recall were on average 0.94, similarly to MZmine 2 and OpenMS (cf. Table 1, Supplementary Table  S1). SIMA obtained slightly worse results and the rest of the tools obtained lower values than SIMA. Interestingly, all tools achieved average alignment precision and recall no higher than 0.94. It may suggest that improperly matched features either are too distant to be matched based on LC-MS information or ground-truth is misspecified.
For the P2 set, we achieved the highest average alignment recall (on average 0.82), i.e. our approach had a minimal num-

Application to the detection of specific biomarkers
Alignstein can detect specific biomarkers in medical applications or biological analysis. To verify this, we analyzed the dataset from Barranger et al.'s work [3]. It contained LC-MS/MS chromatograms of intestinal protein from marine mussels exposed in vivo to various benzo[a]pyrene (BaP) concentrations (0, 5, 50, 100 µg/L).
We checked if Alignstein recognizes tandem mass spectra (MS/MS) information by spatial properties of LC-MS features. To assess this, we detected LC-MS features and annotated them with peptide MS/MS identifications. The accuracy of alignment was quantified using proposed identification recall (IR) defined as follows. We chose all repeating identifications that have annotated features and computed a fraction of them that were properly aligned (cf. Methods). For every BaP concentration, we computed IR for all aligned technical replicates of the sample. We achieved satisfactory results IR equal to 81 %, 78 %, 85 %, 86 % respectively for BaP concentrations 0, 5, 50, and 100 µg/L. As a baseline, we repeated this analysis for the OpenMS algorithm, which achieved similar results with IR equal to 81 %, 76 %, 85 %, and 83 %. Moreover, we calculated the IR separately for every subset of all aligned chromatograms (see Methods). This demonstrated that our approach uniformly treats all chromatograms (cf. Fig. 5a and Supplementary Fig.  S1).
Moreover, we checked if Alignstein can detect corresponding biomarkers for LC-MS measurements of samples under different experimental conditions. For this purpose, we repeated the analysis above by aligning chromatograms across all BaP concentrations. The overall IR was equal to 85 %. Contrary to the previous experiment, IR for OpenMS has fallen to 0.75 %. Alignstein's results were analogously as earlier uniform over all chromatogram subsets (cf. Fig. 5b) with IR values not lower than 67 %, reaching even 100 % for some subsets of repeated identifications. This proves that, despite the varying experimental conditions, our solution is able to correctly align the vast majority of corresponding features without accuracy loss. Finally, this experiment shows that it may be applied as a tool for biomarkers screening in LC-MS analysis.

Discussion
Alignstein is a novel, original algorithm for LC-MS alignment based on the GWD feature dissimilarity measure. This allows for incorporating not only distances between features, but also their spatial differences and thus more accurate feature alignment. The GWD emerges to be a key solution for correctly aligning signals with a swapped elution order, as demonstrated in the previous sections.
In addition to correctly resolving feature swaps, Alignstein has more advantages over the majority of alignment algorithms. It requires no prior feature identification, so LC-MS data without additional tandem mass spectra suffice as input to the algorithm. Moreover, our approach makes no assumptions about the characteristics of the analyzed chromatograms, so it is not limited to one type of data (e.g. proteomic or metabolomic). Still, specific properties of the analyzed data (e.g. maximum drift size) can be passed as algorithm parameters. Finally, it treats uniformly all analyzed chromatograms, and thus it does not require a reference sample.
Alignstein requires only the prior feature detection as a data preprocessing step. Although approaches with this requirement are criticized [8,20], we argue that the analysis with detected features is more accurate than the analysis of raw chromatograms. Properly executed feature detection effectively discriminates regions of high signal-to-noise ratio from chro- matograms [37]. Moreover, multidimensional feature detection is crucial for collecting information about coeluting ions (e.g. isotopic envelopes of compounds). Without this, any alignment algorithm might yield inaccurate results by aligning signals across isotopic envelopes.
Besides advantages, Alignstein has also limitations. It correctly matches the vast majority of features, but it happens to fail to match distant features. This mismatch can be explained by interpreting GWD as a sum of two costs: the cost of transporting the feature along the RT (to eliminate drift) and the cost of transformation (to incorporate feature-feature spatial differences). For a pair of distant, corresponding features, the cost of transport along the RT far exceeds the cost of transformation. For this reason, even highly dissimilar but much closer features may camouflage the correct feature correspondence. This can be particularly troublesome for complex datasets having a significant number of features, which are densely packed within chromatograms. This limitation can be only partially corrected by adjusting GWD parameters because the majority of corresponding feature pairs have RT differences of less than 10 seconds (cf. Supplementary Fig. S2) and thus the GWD parameters must be optimized for small feature distances. One of the possible solutions is to incorporate additional information for alignment, for example, MS/MS data. Thus, we plan to extend our algorithm to deal with LC-MS/MS datasets in a data-independent acquisition mode.
In conclusion, Alignstein correctly aligns chromatograms as we have shown in the biomarkers detection experiment, by reproducing the CAAP evaluation study, as well as in swaps resolving computational comparison. Its highly competitive matching accuracy is the result of applying the Generalized Wasserstein Distance as a feature dissimilarity measure, which allows matching features without reducing feature spatial information or the dimension of data. Thus, Alignstein is capable of detecting non-obvious signal patterns and finding optimal alignment. Our solution provides a solid basis for further applications of optimal transport theory to the multidimensional problems of automated analysis in mass spectrometry. We hope that the optimal transport-based distances will become a new paradigm as a measure of spectra dissimilarity and will allow the construction of highly effective, robust, and accurate algorithms for mass spectrometry analysis.

Feature dissimilarity measure
The most common approach to comparing mass spectra is a cosine-like similarity score [28,38]. Despite its popularity, this class of scores is not applicable to feature alignment, because they are not scalable with dimension and cannot effectively compare spectra of significantly different molecules [29]. To address these limitations, we propose the Wasserstein distance [39] with additional generalizations [14,40] as a feature dissimilarity measure.
The Wasserstein distance is a metric based on optimal transport theory. It describes how to optimally transform one feature into the other one. These transformations may include shifting the signal as well as splitting or combining the signal between peaks (cf. Fig. 2). Formally, suppose that we have two discrete features µ and ν so that µ(x) is the intensity of µ at M/Z value x. Then we define the transport plan T so that T(x, y) corresponds to the amount of signal that is transported from a peak x of feature µ to peak y of feature ν. The transport cost is the sum of amounts of transport between all pairs of peaks multiplied by the distance between peaks: where d(x, y) is a distance between peaks x and y. For this setup, we have chosen d(x, y) to be 1 distance (a Manhattan distance). The Wasserstein distance W is the minimal transport cost of all possible transport plans T: Besides effectiveness, we observed that Wasserstein distance unsatisfactorily deals with noisy features. To overcome this limitation, we use a generalization of Wasserstein distance (GWD) as proposed by Chizat et al. [40]. GWD differs mainly from Wasserstein distance by the possibility to omit transporting part of the signal with a constant penalty. More specifically, GWD allows omitting the transport of signal on a distance larger than the user-defined λ parameter with a constant penalty proportional to λ and the amount of not transported signal: where Tµ and Tν are the marginals of the transport plan. F is a divergence chosen so that the approximation of the transport plan T to features µ and ν is possible. To compute GWD, we regularize it with the entropic term which allows for fast and numerically stable computation, using a scaling Sinkhorn-Knopp approximation algorithm [41]. Fully formal distance derivation is available in Supplementary Material sections 1-2.

Alignstein algorithm scheme
Alignstein is an algorithm for LC-MS alignment. Here, the alignment is formulated as finding the correspondence of detected features, which represent the same chemical entities (e.g. ions, compounds). Specifically, the algorithm takes chromatograms with detected features as an input, and the outcome of the algorithm is a list of consensus features. Consensus features are sets of corresponding features from distinct chromatograms. The algorithm outline is depicted in Fig. 1 and pseudocode is available in Algorithm 1.
Alignstein starts with feature preprocessing. If the features are not provided by the user, it detects them using the Feature Finder algorithm from the OpenMS package. Features are represented as a set of all signal peaks contained within the boundaries of detected features. For further processing, Alignstein normalizes features and scales the RT so that the RT dimension variation becomes of a similar order of magnitude as the M/Z dimension variation. Scaling is done by dividing the RT by a factor proportional to the ratio of the average feature length (along the RT axis) and the average feature width (along the M/Z axis).
After preprocessing, alignment consists of two phases: the centroid clustering phase and then the feature matching phase. During the former one, centroids of features from all chromatograms are clustered using Mini-Batch K-Means [42] and hierarchical clustering algorithms. Clustering is computed to create candidates for consensus features, which are then verified by the feature matching phase. During this phase, the algorithm searches for pairs of the most similar features across all chromatograms. It is done by finding the feature matching of minimal cost, where the cost is equal to the sum of GWDs between the matched features. We interpret this problem as finding the maximal flow of minimum cost in an appropriately designed flow network, in which we compare iteratively every chromatogram with the rest of the chromatograms. In Fig. 4, we show a flow network for a single chromatogram. The flow of minimum cost is obtained using the network simplex [43] algorithm. This minimization problem has formulated restrictions, such that features may be not matched with a constant penalty, or at most one feature may be matched to features within one cluster. These restrictions assure that features are matched appropriately as required, e.g. the algorithm treats uniformly all input chromatograms, at most one feature from every chromatogram would be chosen to consensus features, etc. Finally, consensus features are created via clusters that have been matched as most similar in optimal matching. A more detailed algorithm description is available in Supplementary Material section 3.
In the special case, when only two chromatograms are aligned, the clustering phase is omitted and consensus features are created by finding the optimal matching between two feature sets (cf. Supplementary Fig. S3).

Implementation details
Alignstein is implemented as a Python 3 package and available at [44]. It uses C++ implementation of GWD in the MassSinkhornmetry package available at [45]. For centroid clustering, we used clustering algorithms implemented in the scikit-learn package [46,47]. For solving the minimum cost flow problem, we used the data structures and algorithms implemented in NetworkX [48,49] package.
We analyzed two proteomic datasets (P1 and P2) and one metabolomic (M1) from the CAAP study. For all sample sets, preparation, and analysis protocols are described in the original study. For metabolomic set as well as for every fraction at different salt bumps (0, 20, 40, 60, 80, and 100 mM ammo- nium chloride) of both proteomic sets, the authors prepared a set of ground-truth consensus features, which represent feature correspondence over chromatograms of significantly high confidence.
To assess the accuracy of alignment, the authors of the CAAP study proposed the generalization of precision and recall as alignment precision and alignment recall. Alignment precision measures how the given ground-truth consensus feature was split over tool consensus features, i.e. it reflects the number of false positives. Alignment recall measures how many features of a given ground-truth consensus feature are found by the algorithm, i.e. it reflects the number of false negatives. Both alignment precision and recall are calculated as an arithmetic mean over all ground-truth consensus features. Furthermore, the authors of SIMA [18] and Wandy et al. [24] proposed the F-score which is the harmonic mean of alignment precision and alignment recall ( 2·P·R P+R , where P is alignment precision and R is alignment recall) to express the balance of alignment precision and alignment recall.
We used input chromatograms as mzML and mzXML files and features as featureXML files provided by authors of the CAAP study. We measured alignment precision and recall using an evaluation script written in R programming language by the authors of this study. Computation was done on a computer with Linux operating system and 24 Intel Xeon E5-2620 2.10 GHz processors + 62 GB RAM. We measured wall time using Linux built-in time command. More details on CAAP benchmarking are provided in the Supplementary Material section 4.

Mussels toxicological response experiment summary
For assessment of Alignstein's ability to detect specific biomarkers, we analyzed chromatograms originally created in the work of Barranger et al. [3]. The original study aimed to measure the effects of polluting the marine mussels (Mytilus galloprovincialis) environment with fullerene (C60) and benzo[a]pyrene (BaP). For this purpose, the authors performed a proteomic analysis.
Mussels were collected in Trebar with Strand, Cornwall, UK, and were exposed in vivo to C60 and BaP at concentrations 0, 5, 50, and 100 µg/L as described in the original study. For proteomic analysis, mussel intestinal proteins were collected. After digestion and purification, the peptides were analyzed by the LC-MS/MS system with data-dependent acquisition (DDA) mode as described in Sequiera et al. [55]. In summary, peptides were separated on Dionex Ultimate 3000 RSLC nanoflow system: Acclaim PepMap C18 nano column 75 µm × 25 cm, 3 µm, 100 Å plus bypass, a linear gradient of 96 % buffer A (0.5 % Acetic Acid) and 4 % buffer B (80 % acetonitrile in 0.5 % acetic acid) to 60 % buffer A and 40 % buffer B, the flow rate of 300 ml/min for 120 minutes. Separated analytes were analyzed in an Orbitrap Velos Pro FTMS (Thermo Finnigan) with positive ion mode ionization with Proxeon nanospray ESI source. In each run, the 10 most abundant ions were further analyzed with additional collision-induced dissociation (CID) fragmentation (30 % collision energy) in a linear ion trap spectrometer. For every BaP concentration from 0, 5, 50, to 100 µg/L three replicates were obtained. Collected chromatograms for all BaP exposure levels were deposited in the ProteomeXchange Consortium PRIDE repository (PXD013805) [56,57].

Data analysis for detection of repeating biomarkers
In downloaded chromatograms, we identified peptides using Comet [58,59]. We obtained the database for peptide identification from the original work (taxa Mollusca, subcategory Bivalvia from Uniprot KnowledgeBase, and contaminants from Global Proteome Machine, www.thegpm.org). The most important Comet search parameters were: peptide mass tolerance of 10 ppm, trypsin as search enzyme, concatenated decoy search, and allowed missed enzyme cleavages no higher than 2.
We detected features in chromatograms using the OpenMS algorithm Feature Finder in the Centroided version. We annotated the detected LC-MS features with MS/MS Comet identifications. Peptide MS/MS identifications were represented in LC-MS by retention time in seconds and the ratio of the precursor neutral mass to the assumed charge. The feature was annotated with identification when LC-MS representation of identification was enclosed within feature boundaries. For further analysis, we considered annotated features.
For calculating IR, we computed the number of repeating identifications over chromatograms. For every repeating identification, we checked if annotated features were properly matched by Alignstein. IR was calculated as a ratio of the number of correctly aligned annotated repeating identifications and the total number of annotated repeating identifications.

Number of swaps estimation
We analyzed two replicates of 0 µg/L BaP concentration in the dataset described in the previous section. Computation was done for all pairs of annotated features with repeating identification in both chromatograms. We computed the fraction of these pairs that were swapped, i.e. a feature pair was considered as a swap when the computed feature RT means of the same identifications in two replicates were in a different order.